We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

2.00/5 (Submit Your Rating)

Detroit, MI

SUMMARY

  • Overall 8+ years of professional experience in Information Technology and expertise in BIGDATA using HADOOP framework and Analysis, Design, Development, Testing, Documentation, Deployment and Integration using SQL and Big Data technologies.
  • Excellent understanding / noledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce Programming Paradigm.
  • Understanding of teh Hadoop Architecture and its ecosystem such as HDFS, YARN, MapReduce, Kafka, Sqoop, Avro, Spark, Hive, HBase, impala, Pig, Sqoop, Spark, Spark SQL, Spark Streaming, Oozie, Hue, Flume, and Zookeeper
  • Transforming and retrieving teh data by usingSpark,Impala,Pig,Hive,SSISandMap Reduce.
  • Data Streaming from various sources like cloud (AWS, Azure) and on - premises by using teh toolsSparkandFlume.
  • Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
  • Hands-on use of Spark and Scala API's to compare teh performance of Spark wif Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
  • Data importing and exporting by usingSqoopfrom HDFS to Relational Database Systems and vice-versa.
  • Extensively using open source languagesPerl,Python,ScalaandJava.
  • Extensive hands-on experience in using distributed computing architectures such as AWS products (e.g. EC2, Redshift, EMR, and Elastic search), Hadoop, Python, Spark and TEMPeffective use of Azure SQL Database, MapReduce, Hive, SQL and PySpark to solve big data type problems.
  • Experience in Hadoop streaming and writing MR jobs by using Perl, Python other than JAVA.
  • Excellent noledge and Extensively using Web HDFS REST API commands.
  • Proficient wif Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala.
  • Worked wif Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
  • Experience in automation and building CICD pipelines by usingJenkinsandChef.
  • Develop generic SQL Procedures and Complex T-SQL statements to achieve teh reports generation.
  • Hands on experience on data modelling wifStar schemaandSnowflake schema.
  • Excellent noledge on Business Intelligence toolsSSIS, SSAS,SSRS, Informatica and PowerBI.
  • Design and Implement teh Data Distribution Mechanisms on SQL Server (Transactional, Snapshot, Merge Replications, SSIS and DTS).
  • Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Azure SQL Server, Azure Databricks, Azure Data Lake, Azure Blob Storage, Azure Data Factory
  • Good noledge of Data Marts, OLAP, Dimensional Data Modeling wif Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
  • High Availability and Disaster Recovery Systems Design and Implementation on SQL Server (Always On, Mirroring and Log Shipping).
  • Hands on experience wifSQL Server Failover Cluster wif Active/Passivemodel.
  • Excellent noledge on Database/Data Warehousing concepts such asNormalization, Entity-Relationship Modeling, Dimensional Data Modeling, Schema and Metadata.
  • Monitoring Data Activities (Database Status, Logs, Space Utilization, Extents, Checkpoints, Locks and Long Transactions) and apply improvements.
  • Expertise in writingcomplex SQL queries, made use of Indexing, Aggregation and materialized views to optimize query performance.
  • Developed Spark Applications dat can handle data from various RDBMS (MySQL, Oracle Database) and Streaming sources.
  • Excellent noledge and Extensively using NOSQL databases like Hbase, MongoDB, Cassandra.
  • Excellent noledge onConfidential Azure Services,Amazon Web Servicesand Management.
  • Side by side upgrade, In-Place upgrade and Data Migration.
  • Incident Management, SLA Management, TSG Maintenance and FTM Improvement.
  • TEMPEffectively Plan and Manage project deliverable wif on-site and offshore model and improve teh client satisfaction.
  • Responsible for team goal settings, timely feedback and improve their performance.

TECHNICAL SKILLS

Big Data/ Hadoop Ecosystem: HDFS, Map Reduce YARN, Hive, Pig, Hbase, Kafka, Impala, Zookeeper, Sqoop, Oozie, DataStax & Apache Cassandra, Drill, Flume, Spark, Solr and Avro

Web Technologies: HTML, XML, JDBC, JSP, JavaScript, AJAX

RDBMS: Oracle 12c, MySQL, SQL server, Teradata

No SQL: Hbase, Cassandra, MongoDB

Web/Application servers: Tomcat, LDAP

Methodologies: Agile, UML, Design Patterns (Core Java and J2EE)

Cloud Environment: AWS, MS Azure

Development Tools: Microsoft SQL Studio, IntelliJ, Azure Databricks, Eclipse, NetBeans

Programming Languages: Scala, Python, SQL, Java, PL/SQL, Linux shell scripts.

Tools: Used: Eclipse, Putty, Cygwin, MS Office

BI Tools: Platfora, Tableau, Pentaho

PROFESSIONAL EXPERIENCE

Confidential, Detroit, MI

Senior Big Data Engineer

Responsibilities:

  • Create data pipeline of gathering, cleaning and optimizing data using Hive, Spark.
  • Gathering teh data stored in AWS S3 from various third party vendors, optimizing it and joining wif internal datasets to gather meaningful information.
  • Implemented a generic ETL framework wifhigh availabilityfor bringing related data for Hadoop & Cassandra from various sources using spark.
  • Experienced in usingPlatforaa data visualization tool specific for Hadoop, and created various Lens and Viz boards for a real-time visualization from hive tables.
  • Making a data pipelining wif help Data Fabric job, SQOOP, SPARK, scala and KAFKA. Parallel working in data side oracle and MYSQL server for data designing to source to target.
  • Using Hadoop on Cloud service (Qubole) to process data in AWS S3 buckets.
  • Programming using Java and Scala.
  • A continuous integration and deployment pipeline by using Jenkins and Chef.
  • Joined various tables in Cassandra usingspark and Scalaand ran analytics on top of them.
  • Participated in various upgradations and troubleshooting activities across enterprise.
  • Knowledge in performancetroubleshooting and tuningHadoop clusters.
  • Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI.
  • AppliedSparkadvanced procedures liketext analytics and processingusing thein-memoryprocessing.
  • Implemented ApacheDrillon Hadoop to join data from SQL and No SQL databases and store it in Hadoop.
  • Created architecture stack blueprint for data access wif NoSQL DatabaseCassandra;
  • Brought data from various sources in to Hadoop and Cassandra usingKafka.
  • Experienced in usingTidal enterprise scheduler and OozieOperational Services for coordinating teh cluster and scheduling workflows.
  • Applied spark streaming for real time data transforming.
  • Created multiple dashboards in tableau for multiple business needs.
  • Using partitioning and bucketing in HIVE to optimize queries.
  • Storing data in ORC, Parquet and Avro File format wif compression.
  • Moving data between cloud and on premise Hadoop using DISTCP and proprietary ingest framework.
  • Installed and configured Hive and written Hive UDFs and used piggy bank a repository of UDF’s for Pig Latin.
  • Implemented Partitioning, Dynamic Partitions and Buckets inHIVEfor efficient data access.
  • Exported teh analyzed data to teh relational databases using Sqoop for visualization and to generate reports for teh BI team UsingTableau.
  • ImplementedCompositeserver for thedata virtualizationneeds and created multiples views for restricted data access using a REST API.
  • Queried and analyzed data fromCassandrafor quick searching, sorting and grouping throughCQL.
  • Implemented various Data Modeling techniques forCassandra.
  • Combining various datasets in HIVE to generate Business reports.
  • Scheduling Workflows, Coordinators and Bundle using Oozie.
  • Using Spark Dataframe API in Scala for analyzing data.
  • Devised and lead teh implementation of next generation architecture for more efficient data ingestion and processing.
  • Created and implemented variousshell scriptsfor automating teh jobs.
  • ImplementedApache Sentryto restrict teh access on teh hive tables on a group level.
  • EmployedAVROformat for teh entire data ingestion for faster operation and less space utilization.
  • Experienced in managing and reviewing Hadoop log files.
  • Worked inAgile environment,and used rally tool to maintain teh user stories and tasks.
  • Worked wifEnterprise data supportteams to install Hadoop updates, patches, version upgrades as required and fixed problems, which raised after teh upgrades.
  • Use Jenkins and Maven as build tools.
  • Implemented test scripts to support test-driven development and continuous integration.
  • Used Spark for Parallel data processing and better performances.
  • Write research reports describing teh experiment conducted, results, and findings and make strategic recommendations to technology, product, and senior management. Worked closely wif regulatory delivery leads to ensure robustness in prop trading control frameworks using Hadoop, Python Jupyter Notebook, Hive and NoSql.
  • Wrote production level Machine Learning classification models and ensemble classification models from scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.
  • Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of teh GIT Repositories, and teh access control strategies.

Environment: Hadoop, Map Reduce, Horton Works, Spark, Zeppelin, Qubole, Oozie, Hive, Impala, Kafka, Sqoop, AWS, Cassandra. Tableau, Pig, Teradata, Java, Scala, Jenkins, Maven, Chef, Python, Linux Red-Hat and Teradata, Git.

Confidential, Plano, TX

Big Data Engineer

Responsibilities:

  • Design and Develop Data Collectors and Parsers by using Perl or Python.
  • Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
  • Design and Develop Parsers for different file formats (CSV, XML, Binary, ASCII, Text, etc.).
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing teh data in Azure Databricks.
  • Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
  • Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
  • Design and Develop Dashboards in Zoom-Data and Write Complex Queries and Data Aggregation.
  • Extensive usage of Cloudera Hadoop distribution.
  • Shell Programming and Crontab automation.
  • Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and pre-processing.
  • Experienced in installing, configuring and using Hadoop Ecosystem components.
  • Experienced in Importing and exporting data into HDFS and Hive usingSqoop.
  • Participated in development/implementation of Cloudera Hadoop environment.
  • Data Import and Export from various sources through Script and Sqoop.
  • Extensive usage of Spark for data streaming and data transformation for real time analytics.
  • Extensively using Web HDFS REST API commands in Perl scripting.
  • Big Data management in Hive and Impala (Table, Partitioning, ETL, etc.).
  • Extensive Usage of Hue and other Cloudera tools.
  • Experienced in running query-usingImpalaand used BI tools to run ad-hoc queries directly on Hadoop.
  • IntegratedCassandraas a distributed persistent metadata store to provide metadata resolution for network entities on teh network
  • Involved in various NOSQL databases likeHbase, Cassandrain implementing and integration.
  • Installed and configured Hive and writtenHive UDFsand Used Map Reduce and Junit for unit testing.
  • UsedDataStax Cassandraalong wif Pentaho for reporting.
  • Queried and analyzed data fromDataStaxCassandrafor quick searching, sorting and grouping.
  • Experienced in working wif various kinds of data sources such as Teradata and Oracle. Successfully loaded files to HDFS from Teradata, and load loaded from HDFS to hive and impala.
  • Designed and implemented a product search service usingApache Solr/Lucene.
  • Worked in installing cluster,commissioning & decommissioning of Data nodes, Name node recovery, capacity planning, and slots configuration.
  • UsedYarn Architecture and Map reduce 2.0in teh development cluster for POC.
  • Supported Map Reduce Programs those are running on teh cluster. Involved in loading data from UNIX file system to HDFS.
  • Load and transform large sets of structured, semi structured and unstructured data.

Environment: Hadoop, Cloudera, Map Reduce, Kafka, Impala, Spark, Azure, data bricks, data factory, data lake, Zeppelin, Hue, Impala, Pig, Hive, Sqoop, Java, Scala, Cassandra, SQL, Tableau, Pig, Zookeeper, Teradata, Zoom-Data, Linux Red-Hat and Oracle.

Confidential, Chicago, IL

Big Data Engineer

Responsibilities:

  • Responsible for building scalable distributed data solutions using Hadoop and migrate legacy applications to Hadoop.
  • Wrote teh Spark code in Scala to connect to Hbase and read/write data to teh HBase table.
  • Extracted data from different databases and to copy into HDFS using Sqoop and TEMPhas an expertise in using compression techniques to optimize teh data storage.
  • Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform.
  • Delivered real-time experience and analyzed massive amounts of data from multiple sources to calculate real-time ETA using Confluent Kafka event streaming.
  • Developed teh technical strategy of using Apache Spark on Apache Mesos as a next generation, Big Data and "Fast Data" (Streaming) platform.
  • Implemented Flume, Spark framework for real time data processing.
  • Developed simple to complex Map Reduce jobs using Hive and Pig for analyzing teh data.
  • Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Developed big data ingestion framework to process multi TB data including data quality checks, transformation, and stored as efficient storage formats like parquet and loaded into Amazon S3 using Spark Scala API and Spark.
  • Worked on cloud computing infrastructure (e.g. Amazon Web Services EC2) and considerations for scalable, distributed systems
  • Created teh Spark Streaming code to take teh source files as input.
  • Used Oozie workflow to automate all teh jobs.
  • Deployed an ApacheSolr/Lucenesearch engine server to help speed up teh search of financial documents.
  • Exported teh analyzed data into relational databases using Sqoop for visualization and to generate reports for teh BI team.
  • Developed spark programs using Scala, involved in creating Spark SQL Queries and Developed Oozie workflow for spark jobs
  • Built analytics for structured and unstructured data and managing large data ingestion by using Avro, Flume, Thrift, Kafka and Sqoop.
  • Developed Pig UDF's to no teh customer behavior and Pig Latin scripts for processing teh data in Hadoop.
  • Scheduled automated tasks wif Oozie for loading data into HDFS through Sqoop and pre-processing teh data wif Pig and Hive.
  • Worked on scalable distributed computing systems, software architecture, data structures and algorithms using Hadoop, Apache Spark and Apache Storm etc.
  • Ingested streaming data into Hadoop using Spark, Storm Framework and Scala.
  • Copied teh data from HDFS to MongoDB using pig/Hive/Map reduce scripts and visualized teh streaming processed data in Tableau dashboard.
  • Continuously monitored and managed teh Hadoop Cluster using Cloudera Manager.

Environment: Hadoop, Spark code, Scala, Hbase, AWS, EC2, S3, Oozie, spark streaming, Pig, kafka, Mongo DB, Hive, Map Reduce, Flume

Confidential

Hadoop Developer

Responsibilities:

  • Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in java for data cleaning and pre-processing.
  • Evaluated suitability of Hadoop and its ecosystem to teh above project and implementing / validating wif various proof of concept (POC) applications to eventually adopt them to benefit from teh Big Data Hadoop initiative.
  • Estimated teh Software & Hardware requirements for theName NodeandData Node& planning teh cluster.
  • Extracted teh needed data from teh server into HDFS andBulk Loadedthe cleaned data intoHbase.
  • Lead role in NoSQL column family design, client access software, Cassandra tuning; during migration from Oracle based data stores.
  • Implemented data streaming capability using Kafka and Talend for multiple data sources
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python
  • Build large-scale data processing systems in data warehousing solutions, and work wif unstructured data mining on NoSQL.
  • Specified teh cluster size, allocating Resource pool, Distribution of Hadoop by writing teh specification texts in JSON File format.
  • Designed, implemented and deployed wifin a customer’s existingHadoop / Cassandracluster a series of custom parallel algorithms for various customer defined metrics and unsupervised learning models.
  • Using theSparkframeworkEnhancedandoptimizedproductSparkcode toaggregate,groupand rundata miningtasks.
  • Wrote queries UsingDataStax Cassandra CQLto create, alter, insert and delete elements.
  • Written teh Map Reduce programs,HiveUDFsin Java.
  • Used Map ReduceJUnitfor unit testing.
  • DevelopHIVEqueries for teh analysts.
  • Teh custom File System plugin allows Hadoop MapReduce programs, HBase, Pig and Hive to work unmodified and access files directly.
  • Configured Hadoop tools like Hive, Pig, Zookeeper, Flume, Impala and Sqoop.
  • Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
  • Queried both Managed and External tables created by Hive using Impala.
  • Created ane-mail notificationservice upon completion of job for teh particular team which requested for teh data.
  • Defined job work flows as per their dependencies inOozie.
  • Played a key role in productionizing teh application after testing by BI analysts.
  • Given POC ofFLUMEto handle teh real-time log processing for attribution reports.
  • Maintain System integrity of all sub-components related to Hadoop.

Environment: Apache Hadoop, HDFS, Spark, Solr, Hive, DataStax Cassandra, Map Reduce, Pig, Java, Flume, Cloudera CDH4, Oozie, Oracle, MySQL, Amazon S3.

Confidential

Data Analyst

Responsibilities:

  • Resolving issues related to Enterprise data warehouse (EDW), stored procedures in OLTP system and analyzed, design and develop ETL strategies.
  • Identified performance issues in existing sources, targets and mappings by analyzing teh data flow, evaluating transformations and tuned accordingly for better performance.
  • Worked wif heterogeneous source to Extracted data from Oracle database, XML and flat files and loaded to a relational Oracle warehouse.
  • Troubleshoot standard and reusable mappings and mapplets using various transformations like Expression, Aggregator, Joiner, Router, Lookup (Connected and Unconnected) and Filter.
  • Performed tuning of SQL queries and Stored Procedure for speedy extraction of data to resolve and troubleshoot issues in OLTP environment.
  • Troubleshooting of long running sessions and fixing teh issues related to it.
  • Worked wif Variables and Parameters in teh mappings to pass teh values between sessions.
  • Involved in teh development of PL/SQL stored procedures, functions and packages to process business data in OLTP system.
  • Worked wif Services and Portal teams on various occasion for data issues in OLTP system.
  • Worked wif teh testing team to resolve bugs related to day one ETL mappings before production.
  • Creating teh weekly project status reports, tracking teh progress of tasks according to schedule and reporting any risks and contingency plan to management and business users.

Environment: Informatica PowerCenter, Oracle, PL/SQL, SQL Developer, ETL, OLTP, XML, Toad.

We'd love your feedback!