We provide IT Staff Augmentation Services!

Hadoop/bigdata Developer Resume

2.00/5 (Submit Your Rating)

Durham, NC

SUMMARY

  • Around 7 years of professional IT experience in which 7+years of experience with emphasis on Big Data Technologies, Development and Design of Scala - based enterprise Application.
  • Hands-on experience in Hadoop ecosystem components like HDFS, Map Reduce, Yarn, Hive, Spark, Kafka, Sqoop, Oozie.
  • Experience in using Hadoop cluster in Cloudera’s CDH, Hortonworks HDP.
  • Worked with AWS-based Data ingestion and transformations, setting up data inAWSusing S3 bucket and configuring instance backups to S3 bucket.
  • Good knowledge of Amazon EMR, EC2, Elastic Map Reduce, Autoscaling, lambda, S3 Buckets.
  • Ingested data into Snowflake cloud data warehouse.
  • Knowledge in working with Azure cloud platform (Data Lake, Databricks, Blob Storage, Data Factory)
  • Familiar with Data Extraction tools and ETL tools like Abinitio, Informatica, Talend, Pentaho.
  • Designed HIVE queries to perform data analysis, data transfer, and table design.
  • Experience in analyzing data using HiveQL and extending HIVE core functionality by using custom UDFs.
  • Tested Apache TEZ, an extensible framework for building high-performance batch and interactive data processing applications, on Hive jobs.
  • Implemented AWS solutions usingEC2, S3, RDS, EBS, Elastic Load Balancer, Auto scaling groups, AWS CLI.
  • Ability to build deployment on AWS, build scripts (Boto 3 & AWS CLI), and automated solutions usingShell
  • Responsible for performing reads and writes in Cassandra from and web application by using java JDBC connectivity.
  • Worked with Azure Data Factory (ADF), Data Cloud Architect, Azure, Ansible, Jenkins, Docker, Kubernetes, DevOps, Automation, CI/CD, SaaS solution to compose and orchestrate Azure data services.
  • Involved in file movements betweenHDFSandAWS S3and extensively worked withS3 bucketinAWS.
  • Experienced in the Hadoop ecosystem components like Hadoop Map Reduce, Cloudera, Hortonworks, HBase, Oozie, Hive, Sqoop, Pig, Flume, Kafka, Storm, Spark, Splunk, MongoDB, and Cassandra.
  • Developed software to process, cleanse, and report on vehicle data utilizing various analytics and REST API languages likeJava, Scala, andAkkaAsynchronous Programming Framework
  • Involved in developing webservices using REST,HBase Native API, and BigSQL Client to query data from HBase.
  • Extensively used ETL methodology for supporting data Extraction, Transformation, and Loading in a corporate-wide- ETL solution using SAP BW with strong Knowledge on OLAP, OLTP, Extended Star, Star, Snowflake Schema methodologies.
  • Experience in analyzing data usingHive QL,Pig Latin, and customMapReduceprograms inJava.
  • Good experience in implementing advanced procedures like text analytics and processing the in-memory computing capabilities with Apache Impala.
  • Used theSpark- Cassandra Connector to load data to and from Cassandra.
  • Used multiple components to perform ETL operations in Abinitio like Transform, Reformat, Scan, Rollup, Partition Components, Lookups, Filter Expressions, Joins during the design/development.
  • CreatedAbinitioPlans for running the graphs.
  • Extensive Knowledge in architecture design of Extract, Transform and Load environments usingAbinitioSuite of products ACE, BRE, Metadata hub.
  • Experience inJava, JSP, Servlets, EJB, Hibernate, Spring, JavaScript, Ajax, JQuery, XML, and HTML.
  • Hands-on practical experience with various Ab Initio components such as PDL, Psets, Vectors, Join, Rollup, Scan, Reformat, Partition by Key, round-robin, gather, merge, Dedup sorted, FTP, etc.
  • Developed import rules for importing various sources (EME Datasets, EME Graphs, Database) of metadata into the Metadata hub.
  • Experience in working with different cloud infrastructures likeAmazon Web Services (AWS),AzureandGoogle Cloud Platform (GCP)
  • Strong hands-on experience using major components in Hadoop Ecosystem like Spark, Map Reduce, HIVE, PIG, HBase, Sqoop, Splunk, Oozie, Flume, and Kafka.
  • Expertise in Automating deployment of largeCassandra Clusters on EC2 using EC2 APIs.
  • Developed a data pipeline using Kafka, Spark, and Hive to ingest, transform and analyze data.
  • LeverageAWS, Informatica Cloud, Snowflake Data Warehouse, Hashi Corp Platform, AutoSys, and Rally Agile/SCRUMto implementData Lake, Enterprise Data Warehouse, and advanced data analyticssolutions based on data collection and integration from multiple sources (S3, SQL Server, Oracle, NoSQL, and Mainframe systems). worked on Creating Kafka topics, partitions, writing custom practitioner’s classes.
  • Experienced in writing Spark Applications in Scala.
  • Imported Avro files using Apache Kafka and did some analytics using Spark Scala.
  • Experience in creating RDD, Data frames for the required data, and did transformations using Spark RDD's, Spark SQL.
  • Deployed various Microservices like Spark, MongoDB, Cassandra in Kubernetes and Hadoop clusters using Docker. worked on Google cloud platform(GCP),Kubernetes,dataflow,Pub/Sub,Bigquery,etc.,
  • Extensive Knowledge inJava, J2ee, Servlets, JSP, JDBC, Struts, andSpringFramework.
  • DevelopedSparkcode using Scala andSpark-SQL for faster processing and testing.
  • Expertise in Scala Programming language and SparkCore. Good experience in analysis using Hive and understanding of SQOOP.
  • Experience in importing and exporting data using Sqoop from RDBMS to HDFS and vice-versa
  • Involved in importing Streaming data using FLUME to HDFS and analyzing using PIG and HIVE
  • Experienced in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
  • Strong Knowledge in working Spark, Spark SQL.
  • Expertise in writing Hadoop jobs for Data analysis using MapReduce Hive.
  • Experience in using Relational Databases like Oracle, Postgre SQL, MySQL, Microsoft SQL Server, DB2.
  • Flexible with Unix/Linux and Windows Environments working with Operating Systems like Centos 5/6, Ubuntu 13/14.
  • Performed advanced ETL development activities using Informatica, PL/SQL, Oracle Database tuning, and SQL tuning.
  • Analyzed the extraction of data from data partners, loading, cleansing, and validating the data using programming languages like PL/SQL, Shell Scripting, and python, and reported the cleansed data to the clients using Qlik Sense.

TECHNICAL SKILLS

Hadoop/Big Data ecosystems: HDFS, Map Reduce, Sqoop, Flume, Pig, Hive, Oozie, Impala, Zookeeper and Cloudera Manager, Zookeeper, Spark, Scala

NoSQL Database: HBase, Cassandra

Tools: and IDE: Eclipse, NetBeans, Toad, Putty, Maven, DB Visualizer, VS Code, Qlik Sense, Qlik View

Languages: SQL, PL/SQL, JAVA, Scala, Python

Databases: Oracle, SQL Server, MySQL, DB2, PostgreSQL, Teradata

Tracking Tools and Control: SVN, GIT, Maven

ETL Tools: OFSAA, IBM DataStage

Cloud Technologies: AWS, Azure

PROFESSIONAL EXPERIENCE

Confidential, Durham, NC

Hadoop/Bigdata Developer

Responsibilities:

  • Developed data pipeline using Sqoop,Spark, MapReduce, and Hive to ingest, transform and analyze, customer behavioral data.
  • ImplementedSparkusing python andSparkSQL for faster processing of data and algorithms for real-time analysis inSpark.
  • Experience in using Splunk, Apache Flume for collecting, aggregation, moving large amounts of data from the application server.
  • UsedSparkfor interactive queries, processing of streaming data, and integration with popular NoSQL database for huge volume of data.
  • Building a Scala and spark based configurable framework to connect common Data sources like MYSQL, Oracle, Postgres, SQL Server, Bigquery and load it in Bigquery.
  • Experienced in implementing Real-Time streaming and analytics using various technologies i.e.Spark Streaming and Kafka.
  • Monitoring Bigquery, Dataproc and cloud Data flow jobs via Stack driver for all the environment.
  • Selecting appropriate AWS services to design and deploy an application based on given requirements.
  • Develop efficient Scala programs to perform batch processes on huge unstructured datasets.
  • Involved in converting Hive/SQL queries intoSparktransformations usingSparkRDDs and Scala.
  • Handled importing data from different data sources into HDFS using Sqoop and also performing transformations using Hive, MapReduce, and then loading data into HDFS.
  • Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
  • Worked on the Analytics Infrastructure team to develop a stream filtering system on top ofApache Kafka.
  • Working Experience onSnowflake Elastic data warehouse, cloud-based data-warehousing for storage and analyzing data.
  • Providing automation and deployment of applications inside software containers by providing an additional layer of abstraction and automation of operating-system-level virtualization on Linux using Dockers, Kubernetes, and Vagrants.
  • Good working knowledge in Cloud servicesAmazon EC2,Dynamo DB, API Gateways, S3, Athena andGCP.
  • Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup onAWS.
  • DesignedAWSCloud Formation templates to create VPC, subnets, NAT to ensure successful deployment of Web applications and database templates.
  • Worked on AWS environments such as lambda, serverless applications, EMR, Athena, AWS Glue, IAM policies, S3, CFT, and Ec2.
  • UsedSparkAPIoverClouderaHadoopYARNto perform analytics on data in Hive.
  • Developed Prototype for monitoring real-time application metrics/KPI using Influx DB and Grafana through Kafka and Spark.
  • Good exposure in GCP, GCS, Big Query, Compute and ETL Projects in GCP
  • Hands-on experience with big data tools likeFacebook Presto, Apache Drill,Snowflake.
  • Experience in Docker and Kubernetes, Vagrant, Chef, Puppet, Ansible, Salt Stack, Jenkins.
  • ConfiguredSpark streamingto receive real-time data fromKafkaand store the stream data toHDFS
  • Good knowledge ofKafka, Active MQ, andSpark Streamingfor handlingStreaming Data. designing and implementation of applications using Core Java, J2EE, JDBC, JSP, HTML, Spring Framework, Spring batch framework, Spring AOP, Struts, JavaScript, Servlets.
  • Experience in data extraction into Data tax Cassandraclusterfrom Oracle (RDBMS) using Java Driver or Sqoop tools.
  • Migration from relational databases (source systems), data warehouse (Teradata) to Big Data platforms on GCP.
  • Developed a job server(REST API, spring boot, ORACLE DB)and job shell for job submission, job profile storage, job data (HDFS) query/monitoring.
  • Developed build and deployment scripts using Maven to customize WAR and EAR files.
  • Loaded The RDBMS Data fromthe Oracle,PostgreSQL, and DB2into The HDFS By using The Sqoop.
  • Worked with Trifacta tool for preparing and analyzing data analytics solutions.
  • Analyzed the data by performing Hive queries (Hive QL) to study customer behavior.
  • Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
  • Scheduled and executed workflows in Oozie to run Hive jobs and created UDF's to store specialized data structures in HBase and Cassandra.
  • Worked on writing code Golang to pull data fromKinesisand load data toPrometheuswhich will be used byGrafanafor reporting and visualize
  • Created and maintained the configuration of the Spring MVC Framework IoC container for module services to access the unified API of these modules.
  • Used Hibernate DAO support for performing queries. And handled transactions using spring annotations.
  • Designing and implementing code to handle out-of-order messages and ensure exactly-once state semantics inFlink’s distributed environment.
  • Developed many MapReduce jobs in native Java for pre-processing of the data.
  • Optimizing existing algorithms in Hadoop usingSparkContext,Spark-SQL, Data Frames, and Pair RDD's.
  • TunedSpark/Scala code to improve the performance for data analysis.
  • Performed data validation on the data ingested using MapReduce by building a custom model to filter all the invalid data and cleanse the data.
  • Work heavily with Spring Boot and ApacheFlink.
  • Developed interactive shell scripts for scheduling various data cleansing and data loading process.
  • Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
  • Provide technology leadership and support in building solutions to meet customer needs.
  • Contribute to the design and development of new features and applications.
  • Review of functional specifications and other validation deliverables as assigned.
  • Work as the technical point person to interact with a customer in providing the technical solution and work with the internal architecture team to help design the solution.
  • Researching, designing, implementing, and managing software programs.
  • Development of technical specifications and plans.
  • Developing data models and pipelines in a manner to be scalable and be reusable.
  • Analyze user requirements and convert requirements to design documents.
  • Independently lead the development of microservice components that would fit into the enterprise information technology (IT) ecosystem.
  • Extensive usage ofStruts, HTML, CSS, JSP, JQuery, AJAXandJavaScriptfor interactive pages.
  • Analyzing data from multiple business sources. Validation and migration data based on customer requirements and business needs.
  • Knowledge in working with Azure cloud platform (Data Lake, Databricks, Blob Storage, Data Factory)
  • Worked with Spark using Scala for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frame, Pair RDD's, Spark YARN.
  • Created Shell scripts to automate the data loading from HDFS to HIVE and perform sanity tests on data available in HIVE.
  • Java Mail API was used to notify the Agents about the free quote and for sending Emails to the Customer with Promotion Code for validation.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark data frames, Scala.
  • Fluent in multiple Big Data technologies and programming languages including but not limited to Spark, Scala, SQL, Oracle, Hive.
  • Performed transformations, cleaning, and filtering on imported data using Hive and loaded final data into HDFS.
  • Identifying and integrating several data sources and systems to make way for a faster and reliable platform.
  • Unit testing and functional testing of new functionalities and applications.
  • Providing ongoing support for deployed applications and implementing solutions based on business requirements.
  • Provide peer support to Software Engineers in the design, development, and implementation of new small system components.
  • Work as part of the global team providing support, guidance, and mentoring to the global teams.
  • Support customers and professional services as required to address any questions or resolve any issues related to the platform and Spark applications.
  • Performed advanced ETL development activities using Informatica, PL/SQL, Oracle Database tuning, and SQL tuning.
  • Analyzed the extraction of data from data partners, loading, cleansing, and validating the data using programming languages like PL/SQL, Shell Scripting, and python, and reported the cleansed data to the clients using Qlik Sense

Environment: Hadoop, MapReduce, Yarn,Spark, Bigquery, Hive, GCP, Pig, Kafka, HBase, Oozie, Sqoop, Python, Bash/Shell Scripting, Flume, HBase, Cassandra, Oracle, Core Java, Storm, HDFS, Unix, Teradata, NiFi, Eclipse

Confidential, Austin, Tx

Big DataDeveloper

Responsibilities:

  • Research and recommend a suitable technology stack for Hadoop migration considering current enterprise architecture.
  • Expertise in writingHadoopJobsfor analyzing structured and unstructured data usingHDFS,Hive, HBase, Pig, Spark, Kafka, Scala, Oozie, andTalend ETL.
  • Extensively usedSparkstack to develop preprocessing job which includes RDD, Datasets, and Data frame APIs to transform the data for upstream consumption.
  • Build and maintain scalable data pipelines using theHadoop ecosystemand other open-source components likeHive, andHBase.
  • Good understanding ofNoSQLDatabases and hands-on work experience in writing applications on No SQL databases likeCassandraandMongo DB.
  • Good knowledge in querying data fromCassandrafor searching grouping and sorting.
  • Involved in various NoSQL databases likeHBase, Cassandrain implementing and integration.
  • Deploy Kubernetes in both AWS and Google cloud. Setup cluster, the replicator. Deploy multiple containers in a pod.
  • Using GCP Console, monitor Dataproc cluster and jobs. Stack Driver to monitor Dashboards and do a performance tuning and optimization of jobs which are memory intensive and provide L3 support for the applications in production environment.
  • Reviewed and documented business logic in Informatica mapping, work flow to migrated to GCP.
  • Experience in google platform architecture &Bigquery& Proficient in designing efficient workflows.
  • Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR, MapR distribution.
  • Good experience in Administering and configuring Kubernetes.
  • Migrated an existing on-premises application to AWS.
  • Used AWS services like EC2 and S3 for small data sets.
  • UsedZookeeperfor various types of centralized configurations,SVNfor version control, Maven for project management,Jirafor internal bug/defect management,MapReduce
  • Experience in analyzing data usingHiveQL,SparkSQL,PostgreSQL, Pig Latin, and customMap Reduceprograms in Java.
  • Worked in developing and designing GCP, HBase, Big, Table, Big Query.
  • GCP - Stack driver for monitoring, logging, compute engine and Dataproc.
  • Involved in analyzing data using google Bigquery to discover information, business value, patterns and trends in support of decision making & Worked on data profiling and data quality analysis
  • Involved in migration of Teradata queries into the snowflake Data warehouse queries.
  • Involved in developing Spring IOC Inversion of control, DAO, MVC.
  • Used spring support for restful webservices to communicate with the host machine for agreement forms.
  • Write theGolangcode to pull the data from kinesis and load the data toPrometheuswhich will be used byGrafanafor reporting and visualize
  • Involved in modeling different key risk indicators in Splunk and building extensive Hive & SPL quires to understand behavior across the customer life cycle.
  • Converted existing Snowflake schema data into Star schema in the hive for building OLAP cubes in Kylin.
  • Worked onAWS Relational Database Services, AWS Security Groups, and their rule andimplementedReporting, Notification services using AWS API.
  • ImplementedAWS EC2, Key Pairs, Security Groups, Auto Scaling, ELB, SQS, and SNS using AWS API and exposed as the Restful Web services.
  • Experience with Splunk Enterprise Security app (ES) by Performing data investigation and data analysis to address security vulnerabilities, incidents, and penetration techniques.
  • Implement AWS Data Lake leveraging S3, terraform, vagrant/vault, EC2, Lambda, VPC, and IAMin performing data processing and storage while writing complex SQL queries, analytical and aggregate functions on views in Snowflake data warehouse to develop near real-time visualization usingTableau Desktop/Server 10.4 and Alteryx.
  • Work heavily withSpring BootandApacheFlink.
  • UsingKafka, Spark Streamingfor streaming purposes.
  • Working on new system architecture to replace the client's current credit trading platform usingFlink.
  • Involved in developing of JDBCDAOs and DTOs, access of advanced SQL and PL/SQLstored procedures on database systems using spring templates.
  • Onboarding of new data into Splunk. Troubleshooting Splunk and optimizing performance.
  • Developed multiple MapReduce jobs in java for data cleaning and pre-processing.
  • Worked on extracting and enriching relational databases data between multiple tables using joins inSpark.
  • Worked on writing APIs to load the processed data to Hive tables.
  • ExperiencedScalain using andspark streamingandAkkafor ongoing transactions for customers.
  • Replaced the existing Map Reduce programs intoSparkapplication using Scala.
  • Developed the Hive UDF's to handle data quality and create filtered datasets for further processing
  • Experienced in writing Sqoop scripts to import data into Hive/HDFS from RDBMS.
  • SetupSparkEMR to process huge data which is stored in Amazon S3.
  • Excellent experience using Text mate onUbuntufor writingJava,Scala, andshellscripts.
  • Expert in implementing advanced procedures like text analytics and processing using in-memory computing capabilities likeApacheSparkwritten inScala.
  • Involved to write Map Reduce program using Java.
  • Developed oozie workflow for scheduling & orchestrating the ETL process.
  • Used Talend tool to create workflows for processing data from multiple source systems.
  • Tested Apache TEZ, an extensible framework for building high-performance batch and interactive data processing applications, on Pig and Hive jobs.
  • SupportedMapReduceProgramsthat are running on the cluster and also wroteMapReduce jobsusingJavaAPI.
  • Optimized Hive QL scripts by using execution engines like Tez, Spark.
  • Developed Hive queries to analyze the data in HDFS to identify issues and behavioral patterns.
  • Able to use Python Pandas, NumPy modules for Data analysis, Data scraping, and parsing.
  • Deployed applications using Jenkins’s framework integrating Git- version control with it.
  • Participated in production support regularly to support the Analytics platform
  • Used GIT for version control.

Environment: Hadoop, HDFS, AWS, Hive, Bigquery, Spark SQL, GCP, MapReduce,Sparkstreaming, Abinitio, Sqoop, Oozie, Jupiter Notebook, Docker, Kafka,Spark, Scala, Talend, Shell Scripting.

Confidential, Atlanta, GA

Big Data Developer

Responsibilities:

  • Performed advanced procedures like text analytics and processing using the in-memory computing capabilities ofSpark.
  • DevelopedSparkcode using Scala andSpark-SQL for faster processing and testing.
  • Worked onSparkSQL for joining multi hive tables and write them to a final hive table.
  • Configured Spark Streaming to receive real-time data from the Apache Kafka and store the stream data toHDFSusingScala.
  • Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR and MapR (MapR data platform).
  • Developed prototypeSparkapplications usingSpark-Core, Spark SQL, Data Frame APIand developed several custom User-defined functions inHive & Pig using Java & python
  • Importing the data intoSparkfromKafkaConsumer group usingSpark Streaming APIs.
  • Good Knowledge of reporting and data visualization tools like Oracle Data Visualization Desktop, Tableau, and Grafana.
  • Worked on Google Cloud Platform(GCP) Services like Vision API, Instances.
  • Used Informatica Cloud Data Integration for global, Data Cloud Architect, Azure, Ansible, Jenkins, Docker, Kubernetes, DevOps, Automation, CI/CD, distributed data warehouse, and analytics projects.
  • Utilized Azure Data Factory to create, Data Cloud Architect, Azure, Ansible, Jenkins, Docker, Kubernetes, DevOps, Automation, CI/CD,schedule and manage data pipelines.
  • Developed a POC for project migration from on prem Hadoop MapR system to GCP/Snowflake.
  • Build cluster on AWS environment using EMR using S3, EC2, and Redshift.
  • Built dashboards and visualizations on top of MapR-DB and Hive using Oracle data visualizer desktop. Built real-time visualizations on top of Open TSDB using Grafana.
  • Used DataStax Cassandra along with Pentaho for reporting.
  • Designed, configured, and deployed Amazon Web Services (AWS) for a multitude of applications utilizing the AWS stack (Including EC2, Glue, Data pipeline EMR, SNS, S3, RDS, Cloud Watch, SQS, IAM), focusing on high-availability, fault tolerance, and auto-scaling.
  • Worked with AWS Glue jobs to transform data to a format that optimizes query performance for Athena.
  • Working on new system architecture to replace client's current crediteTradingplatform usingFlink.
  • ImplementedSparkRDD transformations to Map business analysis and apply actions on top of transformations.
  • CreatedSparkjobs to do lighting speed analytics over theSparkcluster.
  • EvaluatedSpark's performance vs Impala on transactional data.
  • Experienced in developing scripts for doing transformations usingScala.
  • Experienced in creating data pipeline integratingKafkawithspark streamingapplication usedScalafor writing applications.
  • Developed customizedUDFs in Javafor extending Pig and Hive functionality.
  • UsedsparkSQLfor reading data from external sources and processes the data usingScalacomputation framework.
  • UsedSparktransformations and aggregations to perform min, max, and average on transactional data.
  • Extracted files from databases through Sqoop and placed in HDFS and processed through spark.
  • Experienced in migrating Hive QL into Impala to minimize query response time.
  • Experience using Impala for data processing on top of HIVE for better utilization.
  • Developed and optimizedPigandHiveUDFsto implement the methods and functionality of Javaas required.
  • Wrote queries UsingCassandra CQLto create, alter,insert and delete elements.
  • Hands-on experience working onNoSQLdatabases includingHBase,MongoDB,Cassandra, and its integration withHadoopcluster.
  • ImplementedAWS EC2, Key Pairs, Security Groups, AutoScaling, ELB, SQS, and SNSusingAWS APIand exposed as theRestful Web services andimplementedReporting, Notification servicesusingAWS API.
  • Performed querying of both managed and external tables created by Hive using Impala.
  • Developed Impala scripts for end-user/analyst requirements for Adhoc analysis.
  • Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
  • UsedSparkAPI over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • UsedHTML, CSS, XML, JavaScriptandJSPfor interactive cross browser functionality and complex user interface.
  • Experience in writingcustom UDFsfor Hive to in corporate methods and functionality of Java into andHQLHIVESQL.
  • Collected data usingSparkStreaming from AWSS3 bucket in near-real-time and performs necessary Transformations and Aggregations to build the data model and persists the data in HDFS.
  • Responsible for creating Hive tables, loading with data, and writing Hive queries.
  • Optimized Hive QL by using execution engines like Tez,Spark.
  • Responsible for creating mappings and workflows to extract and load data from relational databases, flat file sources, and legacy systems using Abinitio.
  • Fetch and generate monthly reports, Visualization those reports using Tableau.
  • Used Oozie Workflow engine to run multiple Hive jobs.

Environment: Hadoop, Cloudera, Flume, HBase, GCP, HDFS, MapReduce, YARN, Hive, Sqoop, Oozie, Tableau, Abinitio, JUnit, agile methodologies, UNIX

Confidential

Hadoop/ETL Developer

Responsibilities:

  • Implemented ETL Abinitio designs and processes for a load of data from the sources to the target warehouse.
  • Responsible for performance tuning at various levels like mapping level, session level, and database level.
  • Worked withAbinitioComponents, Parallelism concepts.
  • Scheduled various daily and monthly ETL loads using scheduler tools.
  • Worked on developing UNIX Scripts.
  • Experience migrating data between HDFS and RDBMS using Sqoop and also exporting and importing using streaming platforms Flume and Kafka.
  • BTEQ, MLOAD, FLOAD utilities of Teradata and Unix Shell scripts.
  • Created tags and save files to migrateAbinitioprojects/objects across the environment.
  • Worked with Session Logs and Workflow Logs for Error handling and troubleshooting.
  • Developing UDFs in java for hive and pig, worked on reading multiple data formats on HDFS using Scala.
  • Perform data masking andETL process using S3, Informatica Cloud, Informatica Power Center, and Informatica Test Data Managementto support Snowflake Data warehousing solution in the cloud.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
  • Analyzed the SQL scripts and designed the solution to implement using Scala
  • Pulling data from oracle and dumping on HDFS as Avro format file and then again converting Avro format to parquet, to resolve the performance issue, we are loading data into hive/impala as parquet files.
  • Pulling data from web services(HTTP proxy)and do cleansing and keep data on hdfs and then apply some logic through Talend(ETL) and pushing into hdfs to access the data from hive and impala.
  • Migrating data from Oracle toData LakeusingSqoop,Spark, and Talend(ETL Tool).
  • Use Job Conductor to deploy the job(.zar) files and schedule and monitor the job.
  • Migrate the code from development to production environment usingNexus,scheduling job to run in production throughCONTROL-M.
  • ImplementedStormbuilder topologies to perform cleansing operations before moving data intoCassandra.
  • Partitioning, Dynamic Partitions, Buckets of HIVE.
  • Implement HIVE UDF’s for evaluation, filtering, loading, and storing of data.
  • Design Managed and External tables in Hive to optimize performance to improve performance
  • Using AutoMap join and avoid skew join, optimize limit operator, enable Parallel Execution, enable MapReduce Strict Mode, Single Reduce for Multi-Group BY function.
  • Load data from different sources (database and files) into Hive usingTalend tool (standard, Map Reduce, and Spark job), monitor System health and logs, and respond to any warning or failure conditions.
  • Build data systems anddata pipelinesthat extract, classify, merge, and deliver new insights.
  • Data Ingestion,aggregating, Loading, and transforming large data sets of structured, semi-structured, and unstructured data intoHadoop(data lake).
  • CreatedHivequeries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
  • Implemented Hive UDF's for evaluation, filtering, loading, and storing of data.
  • Partitioning, Dynamic Partitions, Buckets inHIVE.
  • Data migration from relational(Oracle, Teradata) databases or external data to HDFS usingSqoop.
  • Designed both Managed and External tables in Hive tooptimize performance.To improve performance,we use AutoMap join and avoid skew join, Optimize LIMIT operator,Enable Parallel Execution,Enable MapReduce Strict Mode,Single Reduce for Multi-Group BY.
  • Implemented various loads like Daily Loads, Weekly Loads, and Quarterly Loads using Incremental Loading Strategy.
  • Responsible for Unit Testing and creating unit test plans and preparing unit test cases.
  • Interacting with the system testing team and resolve the issue reported by the testing team.

Environment: Abinitio, Hadoop, Unix, UNIX Scripts, agile methodologies, Teradata, Hive, SQL, Shell Script.

We'd love your feedback!